AITopics | annotation error

Collaborating Authors

annotation error

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

Wang, Hao, Pan, Licheng, Chen, Zhichao, Zheng, Chunyuan, Chu, Zhixuan, Li, Xiaoxi, Lu, Yuan, Liu, Xinggao, Li, Haoxuan, Lin, Zhouchen

arXiv.org Machine LearningMar-20-2026

Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly conditions. In this work, we introduce observational reward modeling -- learning reward models with observational user feedback (e.g., clicks, copies, and upvotes) -- as a scalable and cost-effective alternative. We identify two fundamental challenges in this setting: (1) observational feedback is noisy due to annotation errors, which deviates it from true user preference; (2) observational feedback is biased by user preference, where users preferentially provide feedback on responses they feel strongly about, which creats a distribution shift between training and inference data. To address these challenges, we propose CausalRM, a causal-theoretic reward modeling framework that aims to learn unbiased reward models from observational feedback. To tackle challenge (1), CausalRM introduces a noise-aware surrogate loss term that is provably equivalent to the primal loss under noise-free conditions by explicitly modeling the annotation error generation process. To tackle challenge (2), CausalRM uses propensity scores -- the probability of a user providing feedback for a given response -- to reweight training samples, yielding a loss function that eliminates user preference bias. Extensive experiments across diverse LLM backbones and benchmark datasets validate that CausalRM effectively learns accurate reward signals from noisy and biased observational feedback and delivers substantial performance improvements on downstream RLHF tasks -- including a 49.2% gain on WildGuardMix and a 32.7% improvement on HarmBench. Code is available on our project website.

large language model, machine learning, natural language, (21 more...)

arXiv.org Machine Learning

2603.18736

Country: Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.50)

Industry: Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Appendix

Neural Information Processing SystemsFeb-8-2026, 08:06:05 GMT

We limit the target languages for this augmentation process to Arabic, Finnish, Japanese, Korean, Russian, Spanish, Swedish, Hebrew, Thai,Danish,French,Italian,Dutch,Polish,andPortuguese. Interestingly,justaddingthislanguage code effectively changes the outputs as shown in Table 7. We further subsample 50% of the synthetically generated questions. During inference, we first retrieve top 15 passages using mDPR, and then feed the questions andconcatenated passages intothemGEN model, withlanguage tags. The gray dots concentrated in the lower right part in the first figure represent encoded Thai embeddings.

artificial intelligence, state-of-the-art model, trans, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)

Technology: Information Technology > Artificial Intelligence (0.69)

Add feedback

EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI

Zuo, Longfei, Plank, Barbara, Peng, Siyao

arXiv.org Artificial IntelligenceNov-13-2025

High-quality datasets are critical for training and evaluating reliable NLP models. In tasks like natural language inference (NLI), human label variation (HLV) arises when multiple labels are valid for the same instance, making it difficult to separate annotation errors from plausible variation. An earlier framework VARIERR (Weber-Genzel et al., 2024) asks multiple annotators to explain their label decisions in the first round and flag errors via validity judgments in the second round. However, conducting two rounds of manual annotation is costly and may limit the coverage of plausible labels or explanations. Our study proposes a new framework, EVADE, for generating and validating explanations to detect errors using large language models (LLMs). We perform a comprehensive analysis comparing human- and LLM-detected errors for NLI across distribution comparison, validation overlap, and impact on model fine-tuning. Our experiments demonstrate that LLM validation refines generated explanation distributions to more closely align with human annotations, and that removing LLM-detected errors from training data yields improvements in fine-tuning performance than removing errors identified by human annotators. This highlights the potential to scale error detection, reducing human effort while improving dataset quality under label variation.

explanation, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.08949

Country:

Europe (1.00)
Asia (0.93)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

53fde96fcc4b4ce72d7739202324cd49-AuthorFeedback.pdf

Neural Information Processing SystemsOct-2-2025, 18:03:07 GMT

artificial intelligence, correlation, test time, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.50)

Add feedback

Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers

Seo, Wooseok, Han, Seungju, Jung, Jaehun, Newman, Benjamin, Lim, Seungwon, Lee, Seungbeen, Lu, Ximing, Choi, Yejin, Yu, Youngjae

arXiv.org Artificial IntelligenceJun-17-2025

Fact verification is essential for ensuring the reliability of LLM applications. In this study, we evaluate 12 pre-trained LLMs and one specialized fact-verifier, including frontier LLMs and open-weight reasoning LLMs, using a collection of examples from 14 fact-checking benchmarks. We share three findings intended to guide future development of more robust fact verifiers. First, we highlight the importance of addressing annotation errors and ambiguity in datasets, demonstrating that approximately 16\% of ambiguous or incorrectly labeled data substantially influences model rankings. Neglecting this issue may result in misleading conclusions during comparative evaluations, and we suggest using a systematic pipeline utilizing LLM-as-a-judge to help identify these issues at scale. Second, we discover that frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance. We therefore recommend future studies include comparisons with these simple yet highly effective baselines. Lastly, despite their effectiveness, frontier LLMs incur substantial costs, motivating the development of small, fine-tuned fact verifiers. We show that these small models still have room for improvement, particularly on instances that require complex reasoning. Encouragingly, we demonstrate that augmenting training with synthetic multi-hop reasoning data significantly enhances their capabilities in such instances. We release our code, model, and dataset at https://github.com/just1nseo/verifying-the-verifiers

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.13342

Country:

Asia (0.68)
North America > United States (0.46)

Genre: Research Report > New Finding (0.66)

Industry:

Health & Medicine (1.00)
Leisure & Entertainment > Sports > Motorsports (0.68)
Leisure & Entertainment > Sports > Soccer (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

RePOPE: Impact of Annotation Errors on the POPE Benchmark

Neuhaus, Yannic, Hein, Matthias

arXiv.org Artificial IntelligenceApr-23-2025

Since data annotation is costly, benchmark datasets often incorporate labels from established image datasets. In this work, we assess the impact of label errors in MSCOCO on the frequently used object hallucination benchmark POPE. We re-annotate the benchmark images and identify an imbalance in annotation errors across different subsets. Evaluating multiple models on the revised labels, which we denote as RePOPE, we observe notable shifts in model rankings, highlighting the impact of label quality. Code and data are available at https://github.com/YanNeu/RePOPE .

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2504.15707

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.98)
Information Technology > Artificial Intelligence > Natural Language (0.70)

Add feedback

Common Ground, Diverse Roots: The Difficulty of Classifying Common Examples in Spanish Varieties

Lopetegui, Javier A., Riabi, Arij, Seddah, Djamé

arXiv.org Artificial IntelligenceDec-16-2024

Variations in languages across geographic regions or cultures are crucial to address to avoid biases in NLP systems designed for culturally sensitive tasks, such as hate speech detection or dialog with conversational agents. In languages such as Spanish, where varieties can significantly overlap, many examples can be valid across them, which we refer to as common examples. Ignoring these examples may cause misclassifications, reducing model accuracy and fairness. Therefore, accounting for these common examples is essential to improve the robustness and representativeness of NLP systems trained on such data. In this work, we address this problem in the context of Spanish varieties. We use training dynamics to automatically detect common examples or errors in existing Spanish datasets. We demonstrate the efficacy of using predicted label confidence for our Datamaps \cite{swayamdipta-etal-2020-dataset} implementation for the identification of hard-to-classify examples, especially common examples, enhancing model performance in variety identification tasks. Additionally, we introduce a Cuban Spanish Variety Identification dataset with common examples annotations developed to facilitate more accurate detection of Cuban and Caribbean Spanish varieties. To our knowledge, this is the first dataset focused on identifying the Cuban, or any other Caribbean, Spanish variety.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2412.1175

Country:

North America > Cuba (0.06)
South America > Argentina (0.05)
Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)
(25 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media (0.95)

Add feedback

SubRegWeigh: Effective and Efficient Annotation Weighing with Subword Regularization

Tsuji, Kohei, Hiraoka, Tatsuya, Cheng, Yuchang, Iwakura, Tomoya

arXiv.org Artificial IntelligenceSep-10-2024

Such a method to weigh annotation errors is recently Various NLP tasks exploit the pair of the raw text studied in the NER field. Wang et al. (2019) and the annotation label for training and evaluating proposed CrossWeigh which is the method for detecting models. For example of named entity recognition annotation errors in the dataset and adjusting (NER), which is applied to various practical technologies their learning priority by weighting loss values such as location detection (Inkpen et al., so that the training is not affected by such annotation 2017) and anonymization (Mamede et al., 2016), errors. However, there are shortcomings some parts of the text are annotated as named entities in its computational efficiency, especially in the (e.g., location names or personal names). And recent NLP trends with the pre-trained large language then, a model is trained to extract these entities models. We consider that the more efficient from the raw text. To achieve higher performance methods of annotation weighing can speed up the in NLP tasks, the models should be trained or finetuned development of NLP. In addition, reducing the computational with a sophisticated training dataset without cost contributes to Green AI (Schwartz annotation errors.

annotation error, computational linguistic, dataset, (14 more...)

arXiv.org Artificial Intelligence

2409.06216

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Japan (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(10 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Annotation Errors and NER: A Study with OntoNotes 5.0

Bernier-Colborne, Gabriel, Vajjala, Sowmya

arXiv.org Artificial IntelligenceJun-27-2024

Named Entity Recognition (NER) is a well-studied problem in NLP. However, there is much less focus on studying NER datasets, compared to developing new NER models. In this paper, we employed three simple techniques to detect annotation errors in the OntoNotes 5.0 corpus for English NER, which is the largest available NER corpus for English. Our techniques corrected ~10% of the sentences in train/dev/test data. In terms of entity mentions, we corrected the span and/or type of ~8% of mentions in the dataset, while adding/deleting/splitting/merging a few more. These are large numbers of changes, considering the size of OntoNotes. We used three NER libraries to train, evaluate and compare the models trained with the original and the re-annotated datasets, which showed an average improvement of 1.23% in overall F-scores, with large (>10%) improvements for some of the entity types. While our annotation error detection methods are not exhaustive and there is some manual annotation effort involved, they are largely language agnostic and can be employed with other NER datasets, and other sequence labelling tasks.

annotation error, dataset, entity type, (12 more...)

arXiv.org Artificial Intelligence

2406.19172

Country:

Asia > China > Hong Kong (0.04)
Oceania > Australia (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(12 more...)

Genre: Research Report (0.64)

Industry: Government > Regional Government > North America Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.72)

Add feedback

VariErr NLI: Separating Annotation Error from Human Label Variation

Weber-Genzel, Leon, Peng, Siyao, de Marneffe, Marie-Catherine, Plank, Barbara

arXiv.org Artificial IntelligenceJun-6-2024

Human label variation arises when annotators assign different labels to the same item for valid reasons, while annotation errors occur when labels are assigned for invalid reasons. These two issues are prevalent in NLP benchmarks, yet existing research has studied them in isolation. To the best of our knowledge, there exists no prior work that focuses on teasing apart error from signal, especially in cases where signal is beyond black-and-white. To fill this gap, we introduce a systematic methodology and a new dataset, VariErr (variation versus error), focusing on the NLI task in English. We propose a 2-round annotation procedure with annotators explaining each label and subsequently judging the validity of label-explanation pairs. VariErr contains 7,732 validity judgments on 1,933 explanations for 500 re-annotated MNLI items. We assess the effectiveness of various automatic error detection (AED) methods and GPTs in uncovering errors versus human label variation. We find that state-of-the-art AED methods significantly underperform GPTs and humans. While GPT-4 is the best system, it still falls short of human performance. Our methodology is applicable beyond NLI, offering fertile ground for future research on error versus plausible variation, which in turn can yield better and more trustworthy NLP systems.

annotator, computational linguistic, explanation, (15 more...)

arXiv.org Artificial Intelligence

2403.01931

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Asia > Singapore (0.04)
(13 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback